With the ultimate goal of training a BERT text classifier to identify the nationality/L1 of non-native writers of English, the following project:
~25 hours
Corpora included:
Access Pending:
The following code extracts samples from each corpus, and unifies the labels and samples into a single dataset. Brief descriptions of each corpus are also provided.
%%HTML
<script src="require.js"></script>
import os
import re
#plotting
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px
import plotly
import plotly.io as pio
pio.renderers.default='notebook'
#data handling
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import bs4
from bs4 import BeautifulSoup
pd.set_option('display.max_rows', 75)
pd.set_option('display.max_columns', 10)
# main directories
project_dir = "/Users/paulp/Desktop/UEF/Thesis"
corpus_dir = os.path.join(project_dir, 'Data')
# relative corpus directories
ICLE_dir = os.path.join(corpus_dir, "ICLE/split_texts")
EFCAMDAT_dir = os.path.join(corpus_dir, 'EFCAMDAT')
LANG8 = os.path.join(corpus_dir, 'NAIST_LANG8/lang-8-20111007-2.0/lang-8-20111007-L1-v2.dat')
PELIC = os.path.join(corpus_dir, 'PELIC/PELIC_compiled.csv')
os.chdir(corpus_dir)
https://uclouvain.be/en/research-institutes/ilc/cecl/icle.html
The Version 2 of the International Corpus of Learner English from UC Louvain. Samples adhere closely to Atkins and Clear's (1992) corpus design criteria [ICLE]. Most samples in ICLE are argumentative essays collected from academic environments, representing a range of suggested topics.
The data available to UEF users in V2 does not represent the full range of L1/nationalities of interest. This will be addressed further.
files = os.scandir(ICLE_dir)
nationalities = {}
for a in files:
b = re.split('-', a.name)[1]
if b not in nationalities.keys():
nationalities[b] = 1
else:
nationalities[b] += 1
nationalities
{'GE': 281,
'CN': 757,
'JP': 365,
'SW': 255,
'PO': 350,
'FIN': 193,
'TR': 255,
'RU': 250,
'SP': 186}
dataset = pd.DataFrame(data = None, columns = ['Corpus','Target','Text'])
# fill dataframe with samples
files = os.scandir(ICLE_dir)
for b,a in enumerate(files):
target = re.split('-', a.name)[1]
c = open(a)
text = c.read()
dataset.loc[b,'Target'] = target
dataset.loc[b, 'Text'] = text
dataset.loc[b, 'Corpus'] = 'ICLE'
c.close()
# Remove Swedish and Polish (data too sparse)
dataset = dataset[dataset['Target'] != 'SW']
dataset = dataset[dataset['Target'] != 'PO']
dataset = dataset[dataset['Target'] != 'FIN']
len(dataset)
2094
https://philarion.mml.cam.ac.uk/
This corpus is a collaboration between EF Education First and the Department of Theoretical and Applied Linguistics at the University of Cambridge. The samples were collected from English Live, EF's online language school. Samples are sortable by nationality, level, and other provided variables. As in ICLE, nationality is assumed to correlate with L1.
At first, levels 10-16 were selected for this project; based on the corpus documentation, this corresponds to B2+ CEFR levels [], which is harmonious with the ICLE corpus. However, after this initial exploration, it seemed that the levels were inflated, perhaps because they represent overall English competence rather than being distinctly reflective of writing skills. Ultimately, levels 12-16 were selected to filter out some of the lower quality samples.
To address an under-representation of Spanish language data, Spanish was also sampled from a few Latin American countries. These varieties of Spanish may well impact the model's ability to pick up on 'general' characteristics of Spanish-influenced L2 English, but for now the increase in volume and balanced representation will be assumed a benefit rather than a drawback.
# Process the XML file from EFCAMDAT
efcamdat = os.path.join(EFCAMDAT_dir, 'EF201403_selection1854.xml')
with open(efcamdat) as fp:
soup = BeautifulSoup(fp, features='lxml-xml')
# REMINDER: add Arabic, Korean, and Latin Spanish here
efcamdat_ds = pd.DataFrame(data=None, columns = ['Corpus', 'Target', 'Text'])
nationalities = {'cn':'CN',
'de':'GE',
'es':'SP',
'jp':'JP',
'ru':'RU',
'tr':'TR'}
# Build the DataFrame
for s in soup.find_all('writing'):
level = int(s.get('level'))
text = s.find_all('text')[0].text
#filter out lower level texts
if level >= 12:
nationality = s.find_all('learner')[0].get('nationality')
if nationality in nationalities:
d = pd.DataFrame(data = {'Corpus': ['EFCAM'],
'Target': [nationalities[nationality]],
'Text': [text]
}
)
efcamdat_ds = pd.concat([efcamdat_ds, d])
else:
pass
else:
pass
data = dataset.append(efcamdat_ds)
dataset['Target'] = pd.Categorical(dataset['Target'])
dataset['Corpus'] = pd.Categorical(dataset['Corpus'])
/var/folders/5h/lrwcctsx1xv_r4qlss7b9mt80000gp/T/ipykernel_25565/886358872.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
data.describe()
| Corpus | Target | Text | |
|---|---|---|---|
| count | 10242 | 10242 | 10242 |
| unique | 2 | 6 | 10213 |
| top | EFCAM | GE | \n will be done shortly\n |
| freq | 8148 | 3889 | 6 |
https://eli-data-mining-group.github.io/Pitt-ELI-Corpus/
PELIC contains writing samples from students in the University of Pittsburg English Language Institute, an intensive EAP program.
Because the data is longitudinal, only one writing sample per student was selected: this to prevent the model from identifying the characteristics of individual writers rather than the target group, although the number of samples per student is relatively small in relation to the corpus size. Levels 4-5, corresponding to B1+, were selected. This may later be narrowed to level 5 to better reflect the composition of the other corpora.
In the case of PELIC, L1 (not nationality) is the variable label. Provided that the documentation of ICLE and EFCAMDAT are correct, it is reasonable to fuse nationality and L1 into a variable called 'Target' without significantly polluting the variable.
pelic_ds = pd.read_csv(PELIC)
pelic_nationality_map = {'Arabic':'AR',
'Korean':'KO',
'Chinese':'CN',
'Japanese':'JP',
'Spanish':'SP',
'Turkish':'TR',
'Russian':'RU',
'German':'GE'
}
# Filter by level and L1
reduced = pelic_ds.filter(items=['level_id', 'L1', 'text'])
reduced = reduced.query("level_id >= 4")
# get text and target, change target name
reduced = reduced.filter(items=['L1', 'text'])
reduced_pelic = reduced.apply(lambda row: row[reduced['L1'].isin(pelic_nationality_map.keys())])
# add corpus label and rename columns
reduced_pelic['Corpus'] = 'PELIC'
reduced_pelic = reduced_pelic.rename(columns={'L1':'Target', 'text':'Text'})
reduced_pelic['Target'] = reduced_pelic['Target'].apply(lambda row: pelic_nationality_map[row])
#append to main data
data = data.append(reduced_pelic)
/var/folders/5h/lrwcctsx1xv_r4qlss7b9mt80000gp/T/ipykernel_25565/2718693952.py:15: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
data['Corpus'].value_counts()
PELIC 29142 EFCAM 8148 ICLE 2094 Name: Corpus, dtype: int64
~ 50 hours
Thus far, there are three corpora in the dataset with the number of samples noted above, but more detail about the nature and distribution of the samples is needed, along with insight as to how this may influence results and inform design. The code and visualizations below show:
Note that the zoom feature can be used to isolate specific distributions in the visualizations for more clarity.
Design-related questions are addressed both throughout and at the end of the section.
corpus_colors = {'blue':'ICLE', 'green':'EFCAM', 'violet':'PELIC'}
fig = px.bar(data,
x=data['Target'],
color=data['Corpus'],
opacity=0.8,
title = 'Number of Texts by Nationality Group',
color_discrete_map = corpus_colors)
fig.update_traces(dict(marker_line_width=0)) #run this line if the visualization looks cloudy
fig.show(renderer='notebook')
Note that Arabic and Korean will have data from EFCAMDAT added in the final version. Turkish may be dropped from the project if no other sources of data are found.
# Calculate and Append text lengths using BERT tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
data['Length'] = None
data = data.reset_index(drop=True)
Length = [lambda x: len(tokenizer(x)['input_ids'])]
data['Length'] = data['Text'].apply(func = Length, result_type='expand')
Token indices sequence length is longer than the specified maximum sequence length for this model (557 > 512). Running this sequence through the model will result in indexing errors
fig = px.strip(data,
y="Length",
x="Target",
color="Corpus",
color_discrete_map = corpus_colors,
hover_data=None,
title='Distribution of Text Lengths'
)
fig.show()
# alternative visualization to the strip plot is the violin plot.
# zoom for more clarity.
fig = px.violin(data,
y="Length",
x="Target",
color="Corpus",
box = True,
points = None,
color_discrete_map = corpus_colors,
hover_data=None)
fig.show()
px.histogram(data,
x='Length',
color = 'Corpus',
color_discrete_map = corpus_colors,
range_x = [0,1500],
opacity=1.0,
title= 'Distribution of Text Lengths Overall'
)
Notice the many tiny samples with length <= 50 in EFCAMDAT and PELIC. These are mostly non-informative entries that indicate the task was beyond the students' abilities or they did not have time to complete the task. These are filtered out at a threshold of 170 tokens to make the training samples more informative and efficient.
This threshold was chosen to minimize the number of excluded samples while also making sure the samples are substantial and worth training on. More implications of sample length regarding BERT models will be mentioned later and discussed more fully in the next stage of the project.
# trim below 170 tokens
data = data.query('Length >= 170')
px.histogram(data,
x='Length',
color = 'Corpus',
color_discrete_map = corpus_colors,
range_x = [170,1500],
opacity=1.0,
title = 'Frequency Distribution of Text Lengths'
)
px.histogram(data,
x='Length',
color = 'Corpus',
cumulative = True,
barmode = 'overlay',
histnorm = 'percent',
color_discrete_map = corpus_colors,
range_x = [170,1500],
opacity=0.4,
title = 'Cumulative Distribution of Text Lengths'
)
There are some data imbalance issues, namely that Turkish is underrepresented. One option would be to find data from a separate Turkish learner corpus for inclusion. As can be seen above, however, corpora can vary greatly in composition, quality, and length of samples. Introducing a corpus that represents only one target group might have confounding impact.
Another option is regularizing the model such that more prevalent target groups are not predicted arbitrarily: this approach 'punishes' the model for predicting German or Chinese or Arabic simply because they appear more frequently.
A third option would be to drop Turkish from the data entirely. This would have the benefit of simplifying the classification problem, which is already quite complex, although it underscores a criticism of big data approaches to low-resource languages: although these are the languages in need of more research, they tend to be left out of data-heavy studies. Although Turkish is not resource scarce, by comparison there is a lot less data at our disposal.
A principle design decision in BERT models is setting the maximum sample length in number of tokens. Although this can hypothetically be set as high or low as desired, it comes at performance costs. The standard medium-sized, pretrained BERT model has a max length of 512 tokens. If a training sample is shorter than the max length, mask tokens are passed to the model so it ignores the empty spaces at the end of the sample. If it is longer than the max length, it is truncated, and the end of the sample is lost.
Doubling the max length incurs a computational cost of (at least) a power of 2, as attention weights have to be calculated for each pair of tokens. My machine can handle max_len = 1024, although a single training epoch takes about two hours. Max length of 256 trains faster, but clips quite a bit off of longer samples, leading to massive data loss. This decision will be explored in more detail at the next stage of the project.
data_multi = data.set_index(['Target', 'Corpus']).sort_index()
data_multi.loc[('CN', 'PELIC'), :]
| Text | Length | ||
|---|---|---|---|
| Target | Corpus | ||
| CN | PELIC | In Taiwan, we have a proverb, "Far relative ca... | 429 |
| PELIC | Some people said, "Not all learning takes plac... | 363 | |
| PELIC | There have been lots of debates on the issue t... | 278 | |
| PELIC | Each person has a dream; so when he realizes h... | 261 | |
| PELIC | In 2001, I met my good friend Bingbing while s... | 187 | |
| ... | ... | ... | |
| PELIC | When I was a child, my parents were too busy t... | 245 | |
| PELIC | Legalize Marijuana (Sherry, 4P, 07/23/2012)\nI... | 608 | |
| PELIC | Recently, I have watched a comedy starring Mar... | 200 | |
| PELIC | Summer vacation\n Most children like their sum... | 673 | |
| PELIC | Intergenerational Housing\nThe housing market ... | 606 |
1640 rows × 2 columns
Learners of different nationalities often write about the places, people, and organizations that they know: if certain tokens ('China' or 'Islam' for example) occur disproportionately in one target group, the model will likely use these as a basis of its decision making rather than looking at the structure of the text.
To test informally whether this hypothesis has any merit, we perform NER over the corpus using Stanza, and then compare the results to some measures of dispersion to gauge any correlation. Note that Stanza is trained on very clean data, and (as was shown in a previous notebook) does not perform as well on so-called 'noisy' data in which there are mistakes.
"Standard deviation is a useful measure when we want to see how homogeneous or heterogeneous the distribution of a word is." (Brezina, 2018, 50)
Finding the SD and then the Coefficient of Variation (CV) per for each token across the target groups, we can determine the imbalanced and frequent tokens which are most likely to make the classification task too easy for BERT.
Here I explore coefficient of variation and the dispersion of proportions. For DP, values closer to 1 indicate an uneven distribution, while near-zero values are more even. CV can take values above 1, but higher values also mean uneven distribution.
import stanza
processors = {'tokenize':'ewt','ner':'conll03'}
ner = stanza.Pipeline('en', processors=processors, package='ewt')
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 18.6MB/s] 2022-08-04 10:00:34 WARNING: Language en package ewt expects mwt, which has been added Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/p Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/l Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/d Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/p 2022-08-04 10:01:23 INFO: Loading these models for language: en (English): ======================= | Processor | Package | ----------------------- | tokenize | ewt | | mwt | ewt | | pos | ewt | | lemma | ewt | | depparse | ewt | | ner | conll03 | ======================= 2022-08-04 10:01:23 INFO: Use device: cpu 2022-08-04 10:01:23 INFO: Loading: tokenize 2022-08-04 10:01:23 INFO: Loading: mwt 2022-08-04 10:01:23 INFO: Loading: pos 2022-08-04 10:01:24 INFO: Loading: lemma 2022-08-04 10:01:24 INFO: Loading: depparse 2022-08-04 10:01:24 INFO: Loading: ner 2022-08-04 10:01:24 INFO: Done loading processors!
#Generate a rough list of tokens which are parts of named entities across the corpus
#this takes a while to run
NEs = {}
for a in data['Text']:
d = ner(a).to_dict()
for s in d:
if s[0]['ner'] != 'O':
NEs[s[0]["text"]] = s[0]['ner']
data = data.reset_index(drop=True)
#freq_dist = pd.DataFrame(freq_dist)
#freq_dist.columns = ['Target', 'Corpus', 'Token']
# Generate a Labeled token list
iterables = [["Target"], data['Target'].unique()]
col = pd.MultiIndex.from_product(iterables, names=["1", "2"])
frequency_list = pd.DataFrame(data = None,
columns = col)
for a in data.index:
tgt = data.loc[a, 'Target']
ts = tokenizer.tokenize(data.loc[a,'Text'])
for t in ts:
if t not in frequency_list.index:
frequency_list.loc[t] = 0
frequency_list.loc[t, tgt] += 1
frequency_list['Total'] = frequency_list.sum(axis=1)
NE = []
for a in frequency_list.index:
try:
NE.append(NEs[a][2:])
except:
NE.append('O')
frequency_list['NE'] = NE
frequency_list.sort_values(by=['Total'], ascending=False)
| first | Target | Total | NE | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| second | GE | CN | JP | TR | RU | SP | AR | KO | ||
| . | 39561 | 67222 | 35120 | 18624 | 29502 | 17072 | 48249 | 39257 | 294607 | O |
| , | 25677 | 57087 | 27070 | 12838 | 21833 | 17914 | 39775 | 33887 | 236081 | O |
| the | 31657 | 50758 | 21134 | 13384 | 20085 | 16743 | 38137 | 20279 | 212177 | O |
| to | 21768 | 34854 | 17017 | 9065 | 15284 | 10733 | 25860 | 17863 | 152444 | O |
| and | 18306 | 26442 | 12200 | 7403 | 13965 | 9041 | 23550 | 13696 | 124603 | O |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Tier | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | O |
| ##enbach | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | O |
| Alexandre | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | O |
| anthropology | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | O |
| ##sket | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | O |
23317 rows × 10 columns
"DP (Deviation of Proportions) is a measure proposed by Gries (2008) which compares the expected distribution of a word or phrase in different corpus parts with the actual distribution. " (Brezina, 52, 2018)
"The coefficient of variation is a standardized measure; this means that it can be compared across different words and phrases in one corpus. The closer the coefficient is to zero, the more even the distribution of the word or phrase is." (Brezina, 50, 2018)
def get_DP(df, col):
exp_prop = df[col].sum(axis=0) / df[col].sum(axis=0).sum()
obs_prop = df[col].div(df[col].sum(axis=1), axis=0)
DP = obs_prop.sub(exp_prop).sum(axis=1).abs()/2
return DP
def get_SD(df, col):
cats = df[col].shape[1]
mean = df[col].sum(axis=1) / cats
sos = df[col].sub(mean, axis = 0).pow(2).sum(axis=1)
SD = sos.div(cats).pow(0.5)
return SD
def get_CV(df, col):
sd = get_SD(df, col)
mean = df[col].sum(axis=1) / cats
CV = sd.div(mean)
return CV
f2 = frequency_list.assign(DP = get_DP(frequency_list, 'Target'),
SD = get_SD(frequency_list, 'Target'),
CV = get_CV(frequency_list, 'Target')
).sort_values('SD', ascending=False).reset_index()
f2.to_csv(os.path.join(corpus_dir, 'frequency_dist_2.csv')) #save
f2 # frequency list with some common dispersion measures added
| first | index | Target | Total | NE | DP | SD | CV | SD_log | CV_exp | Total_log | Mask | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| second | GE | CN | JP | TR | RU | SP | AR | KO | ||||||||||
| 0 | . | 39561 | 67222 | 35120 | 18624 | 29502 | 17072 | 48249 | 39257 | 294607 | O | 1.040834e-17 | 15189.949238 | 0.412480 | 9.628389 | 1.510560 | 12.593398 | 0 |
| 1 | , | 25677 | 57087 | 27070 | 12838 | 21833 | 17914 | 39775 | 33887 | 236081 | O | 1.040834e-17 | 13119.755623 | 0.444585 | 9.481874 | 1.559843 | 12.371930 | 0 |
| 2 | the | 31657 | 50758 | 21134 | 13384 | 20085 | 16743 | 38137 | 20279 | 212177 | O | 0.000000e+00 | 11865.583598 | 0.447384 | 9.381397 | 1.564215 | 12.265176 | 0 |
| 3 | to | 21768 | 34854 | 17017 | 9065 | 15284 | 10733 | 25860 | 17863 | 152444 | O | 0.000000e+00 | 7843.208256 | 0.411598 | 8.967403 | 1.509228 | 11.934553 | 0 |
| 4 | and | 18306 | 26442 | 12200 | 7403 | 13965 | 9041 | 23550 | 13696 | 124603 | O | 3.469447e-18 | 6286.282485 | 0.403604 | 8.746125 | 1.497211 | 11.732888 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23312 | Armstrong | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | O | 6.938894e-18 | 0.330719 | 2.645751 | -1.106486 | 14.094030 | 0.000000 | 0 |
| 23313 | orbit | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | O | 6.938894e-18 | 0.330719 | 2.645751 | -1.106486 | 14.094030 | 0.000000 | 0 |
| 23314 | Thames | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | O | 6.938894e-18 | 0.330719 | 2.645751 | -1.106486 | 14.094030 | 0.000000 | 0 |
| 23315 | loosened | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | O | 6.245005e-17 | 0.330719 | 2.645751 | -1.106486 | 14.094030 | 0.000000 | 0 |
| 23316 | ##sket | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | O | 6.938894e-18 | 0.330719 | 2.645751 | -1.106486 | 14.094030 | 0.000000 | 0 |
23317 rows × 18 columns
# transform the statistics so they are clearer to visualize
f2['SD_log'] = f2['SD'].transform(np.log)
f2['CV_exp'] = f2['CV'].transform(np.exp)
f2['Total_log'] = f2['Total'].transform(np.log)
px.scatter(f2,
x = 'Total_log',
y= 'DP',
color='NE',
hover_data=['index'],
title= 'Log of Absolute Frequency')
The most informative visualization plots the log of the total token frequency against the exponent of the coefficient of variation. These transformations exaggerate the high-risk values, spreading them out to be more visible.
fig = px.scatter(f2,
x = 'Total_log',
y= 'CV_exp',
color='NE',
opacity=0.6,
hover_data=['index'],
title= 'Transform of CV and absolute frequency')
fig.show()
Notice how most of the country names and language names appear in the sparse area on the upper left. There is also some suggestion of topical imbalance, with words like: 'credit', 'card', 'debt', 'repay', and 'betting' appearing in this area as well. So named entities are not necessarily the only tokens which will have to be masked.
At first, I tried density-based clustering to try to separate the sparse values at the upper right from the denser values. Although this delivered intuitive results, another algorithm will be used to make a more principled selection of tokens to be masked.
Use sklearn's nearest neighbors algorithm to find the epsilon and
from sklearn.neighbors import NearestNeighbors
mdl = f2.loc[:, ['CV_exp', 'Total_log']]
nn = NearestNeighbors(n_neighbors=2)
nbrs=nn.fit(mdl) # fitting the data to the object
distances,indices=nbrs.kneighbors(mdl)
distances = np.sort(distances, axis = 0)
distances = distances[:, 1]
# plotting the distances
px.scatter(distances)
from sklearn.cluster import DBSCAN
# cluster the data into five clusters
dbscan = DBSCAN(eps = 0.80, min_samples = 100).fit(mdl) # fitting the model
#dbscan = DBSCAN(eps = 1.3, min_samples = 100).fit(mdl) # fitting the model
labels = dbscan.labels_ # getting the labels
f2['Mask'] = labels
f2.loc[f2['CV_exp']<=5.0, 'Mask'] = 0 # keep the high-frequency, low CV items out of the filter
f2.loc[f2['Total_log']<3.70, 'Mask'] = 0 # get low frequency items out of the filter
f2['Mask'] = pd.Categorical(f2['Mask']) # change from float to categorical type
freq_thresh = 3.8
cv_thresh = 3.7
x_range = np.array(np.arange(freq_thresh, max(f2['Total_log']), 0.01))
b = 0.5
filter_func = 1/(b*(x_range-freq_thresh))+cv_thresh
fig = px.scatter(f2,
x = 'Total_log',
y= 'CV_exp',
color='Mask',
opacity=0.6,
hover_data=['index'],
title= 'Masking ',
)
fig.add_scatter(x = x_range, y = filter_func, mode = 'lines')
fig.update_layout(yaxis_range=[0, max(f2['CV_exp'])+1.0],
xaxis_range = [0, max(f2['Total_log'])+1.0])
#fig.update_traces(legendgrouptitle=dict('text','Filter'))
fig.show()
/var/folders/5h/lrwcctsx1xv_r4qlss7b9mt80000gp/T/ipykernel_25565/2277061660.py:10: RuntimeWarning: divide by zero encountered in true_divide
The points in red would be masked in the dbscan approach. This looks intuitive but does not follow any principled logic. I will first clarify a problem with the dbscan mask as is, and then propose a way to optimize the position and shape of the green line to form the mask boundary.
f2.style.set_sticky(axis='columns')#.hide(axis=”index”)
f2.loc[f2['Mask'] == -1].sort_values(by='Total_log', ascending=False)
| first | index | Target | Total | NE | DP | SD | CV | SD_log | CV_exp | Total_log | Mask | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| second | GE | CN | JP | TR | RU | SP | AR | KO | ||||||||||
| 31 | > | 55 | 3584 | 28 | 35 | 17 | 21 | 50 | 151 | 3941 | O | 4.163336e-17 | 1169.117502 | 2.373240 | 7.064004 | 10.732111 | 8.279190 | -1 |
| 30 | < | 54 | 3573 | 24 | 26 | 18 | 9 | 33 | 97 | 3834 | O | 6.245005e-17 | 1169.612943 | 2.440507 | 7.064428 | 11.478858 | 8.251664 | -1 |
| 46 | R | 136 | 2735 | 44 | 28 | 59 | 57 | 320 | 74 | 3453 | PER | 4.857226e-17 | 875.046133 | 2.027330 | 6.774277 | 7.593781 | 8.146999 | -1 |
| 35 | Hong | 0 | 3355 | 60 | 0 | 0 | 0 | 2 | 5 | 3422 | LOC | 5.551115e-17 | 1106.565266 | 2.586944 | 7.009016 | 13.289097 | 8.137980 | -1 |
| 37 | Kong | 1 | 3316 | 1 | 0 | 0 | 0 | 0 | 0 | 3318 | LOC | 6.938894e-18 | 1096.569509 | 2.643929 | 6.999942 | 14.068368 | 8.107117 | -1 |
| 45 | credit | 37 | 2820 | 17 | 6 | 19 | 11 | 18 | 22 | 2950 | O | 2.775558e-17 | 926.523846 | 2.512607 | 6.831440 | 12.337051 | 7.989560 | -1 |
| 65 | smoking | 12 | 2067 | 29 | 31 | 12 | 22 | 410 | 85 | 2668 | O | 4.163336e-17 | 667.140353 | 2.000421 | 6.503000 | 7.392166 | 7.889084 | -1 |
| 72 | smoke | 18 | 1901 | 69 | 16 | 16 | 42 | 323 | 62 | 2447 | O | 4.163336e-17 | 610.415727 | 1.995638 | 6.414140 | 7.356894 | 7.802618 | -1 |
| 105 | restaurants | 70 | 1497 | 129 | 24 | 49 | 14 | 217 | 202 | 2202 | O | 6.938894e-18 | 467.375050 | 1.698002 | 6.147132 | 5.463021 | 7.697121 | -1 |
| 70 | Japanese | 14 | 83 | 1903 | 2 | 13 | 2 | 36 | 67 | 2120 | MISC | 2.775558e-17 | 619.743495 | 2.338655 | 6.429306 | 10.367280 | 7.659171 | -1 |
| 94 | card | 68 | 1670 | 76 | 12 | 19 | 11 | 77 | 65 | 1998 | O | 5.551115e-17 | 537.480174 | 2.152073 | 6.286892 | 8.602671 | 7.599902 | -1 |
| 95 | Japan | 15 | 79 | 1657 | 9 | 15 | 13 | 35 | 91 | 1914 | LOC | 0.000000e+00 | 536.676287 | 2.243161 | 6.285395 | 9.423071 | 7.556951 | -1 |
| 90 | Korea | 1 | 43 | 69 | 3 | 5 | 3 | 16 | 1660 | 1800 | LOC | 1.110223e-16 | 542.852420 | 2.412677 | 6.296837 | 11.163811 | 7.495542 | -1 |
| 104 | China | 19 | 1491 | 35 | 39 | 9 | 16 | 42 | 66 | 1717 | LOC | 1.387779e-17 | 482.718069 | 2.249123 | 6.179433 | 9.479421 | 7.448334 | -1 |
| 106 | cards | 86 | 1435 | 43 | 18 | 49 | 19 | 9 | 24 | 1683 | O | 4.163336e-17 | 463.431747 | 2.202884 | 6.138659 | 9.051080 | 7.428333 | -1 |
| 135 | waste | 98 | 1224 | 91 | 22 | 53 | 34 | 101 | 54 | 1677 | O | 4.857226e-17 | 384.396585 | 1.833734 | 5.951675 | 6.257210 | 7.424762 | -1 |
| 85 | \ | 0 | 2 | 0 | 2 | 0 | 0 | 1672 | 0 | 1676 | O | 6.938894e-18 | 552.773688 | 2.638538 | 6.314949 | 13.992730 | 7.424165 | -1 |
| 137 | abortion | 3 | 1187 | 16 | 198 | 30 | 51 | 22 | 22 | 1529 | O | 1.387779e-17 | 380.899737 | 1.992935 | 5.942536 | 7.337038 | 7.332369 | -1 |
| 128 | ##ber | 47 | 1257 | 14 | 11 | 21 | 22 | 21 | 54 | 1447 | O | 6.245005e-17 | 406.990613 | 2.250121 | 6.008790 | 9.488883 | 7.277248 | -1 |
| 110 | cafe | 4 | 1383 | 4 | 7 | 24 | 1 | 4 | 3 | 1430 | O | 0.000000e+00 | 455.214167 | 2.546653 | 6.120768 | 12.764306 | 7.265430 | -1 |
| 146 | recycling | 15 | 1105 | 51 | 0 | 8 | 17 | 36 | 23 | 1255 | O | 4.857226e-17 | 358.669569 | 2.286340 | 5.882402 | 9.838860 | 7.134891 | -1 |
| 204 | * | 38 | 787 | 23 | 9 | 37 | 10 | 252 | 77 | 1233 | O | 2.775558e-17 | 250.571565 | 1.625768 | 5.523745 | 5.082323 | 7.117206 | -1 |
| 165 | Chinese | 10 | 1016 | 69 | 18 | 5 | 5 | 41 | 49 | 1213 | MISC | 1.387779e-17 | 327.424013 | 2.159433 | 5.791256 | 8.666222 | 7.100852 | -1 |
| 198 | disadvantage | 43 | 822 | 55 | 45 | 30 | 31 | 106 | 30 | 1162 | O | 6.938894e-18 | 256.855869 | 1.768371 | 5.548515 | 5.861297 | 7.057898 | -1 |
| 141 | Saudi | 4 | 7 | 1 | 0 | 0 | 5 | 1125 | 13 | 1155 | LOC | 1.387779e-17 | 370.663573 | 2.567367 | 5.915295 | 13.031464 | 7.051856 | -1 |
| 155 | professionals | 11 | 1048 | 7 | 1 | 29 | 9 | 12 | 3 | 1120 | O | 2.775558e-17 | 343.283775 | 2.452027 | 5.838557 | 11.611860 | 7.021084 | -1 |
| 163 | Korean | 0 | 44 | 45 | 0 | 0 | 0 | 3 | 1009 | 1101 | MISC | 0.000000e+00 | 329.872300 | 2.396892 | 5.798706 | 10.988973 | 7.003974 | -1 |
| 238 | Germany | 682 | 23 | 37 | 6 | 23 | 10 | 35 | 59 | 875 | LOC | 4.163336e-17 | 216.993627 | 1.983942 | 5.379868 | 7.271348 | 6.774224 | -1 |
| 187 | Arabia | 5 | 7 | 1 | 0 | 0 | 5 | 804 | 6 | 828 | LOC | 4.857226e-17 | 264.776793 | 2.558230 | 5.578887 | 12.912940 | 6.719013 | -1 |
| 200 | banning | 8 | 780 | 1 | 0 | 2 | 0 | 12 | 2 | 805 | O | 2.081668e-17 | 256.810698 | 2.552156 | 5.548339 | 12.834746 | 6.690842 | -1 |
| 240 | import | 26 | 665 | 12 | 5 | 3 | 6 | 15 | 21 | 753 | O | 3.469447e-17 | 215.900693 | 2.293766 | 5.374819 | 9.912194 | 6.624065 | -1 |
| 233 | debt | 2 | 673 | 1 | 1 | 5 | 3 | 7 | 8 | 700 | O | 2.081668e-17 | 221.311997 | 2.529280 | 5.399573 | 12.544470 | 6.551080 | -1 |
| 382 | survey | 58 | 443 | 31 | 4 | 46 | 20 | 15 | 59 | 676 | O | 6.938894e-18 | 136.789071 | 1.618806 | 4.918440 | 5.047058 | 6.516193 | -1 |
| 287 | ##land | 5 | 561 | 19 | 3 | 14 | 11 | 25 | 24 | 662 | O | 2.081668e-17 | 180.919008 | 2.186332 | 5.198049 | 8.902503 | 6.495266 | -1 |
| 292 | Main | 27 | 545 | 11 | 5 | 6 | 7 | 15 | 18 | 634 | MISC | 0.000000e+00 | 176.170904 | 2.222977 | 5.171455 | 9.234779 | 6.452049 | -1 |
| 271 | Taiwan | 2 | 572 | 9 | 0 | 0 | 2 | 2 | 4 | 591 | ORG | 1.387779e-17 | 188.292616 | 2.548800 | 5.237997 | 12.791747 | 6.381816 | -1 |
| 378 | ban | 27 | 432 | 27 | 3 | 14 | 5 | 44 | 2 | 554 | O | 0.000000e+00 | 137.789468 | 1.989740 | 4.925727 | 7.313629 | 6.317165 | -1 |
| 392 | Al | 23 | 20 | 10 | 15 | 20 | 23 | 419 | 9 | 539 | LOC | 2.775558e-17 | 132.999001 | 1.974011 | 4.890342 | 7.199497 | 6.289716 | -1 |
| 429 | ##id | 30 | 23 | 12 | 23 | 13 | 13 | 390 | 30 | 534 | O | 6.938894e-18 | 122.370084 | 1.833260 | 4.807050 | 6.254240 | 6.280396 | -1 |
| 317 | railway | 14 | 497 | 3 | 1 | 7 | 0 | 0 | 0 | 522 | O | 3.469447e-17 | 163.249617 | 2.501910 | 5.095280 | 12.205783 | 6.257668 | -1 |
| 326 | mainland | 1 | 486 | 1 | 0 | 0 | 1 | 0 | 0 | 489 | O | 3.469447e-17 | 160.588323 | 2.627212 | 5.078844 | 13.835141 | 6.192362 | -1 |
| 365 | Russia | 12 | 4 | 2 | 3 | 437 | 3 | 15 | 10 | 486 | LOC | 0.000000e+00 | 142.281192 | 2.342077 | 4.957805 | 10.402823 | 6.186209 | -1 |
| 421 | parks | 14 | 385 | 11 | 0 | 10 | 5 | 23 | 21 | 469 | O | 2.081668e-17 | 123.562674 | 2.107679 | 4.816749 | 8.229118 | 6.150603 | -1 |
| 646 | Government | 6 | 261 | 31 | 7 | 5 | 31 | 31 | 19 | 391 | O | 5.551115e-17 | 80.904940 | 1.655344 | 4.393275 | 5.234881 | 5.968708 | -1 |
| 464 | scheme | 2 | 352 | 4 | 1 | 12 | 4 | 0 | 3 | 378 | O | 6.938894e-18 | 115.235357 | 2.438844 | 4.746977 | 11.459780 | 5.934894 | -1 |
| 432 | Seoul | 0 | 2 | 6 | 0 | 0 | 0 | 0 | 369 | 377 | LOC | 0.000000e+00 | 121.673166 | 2.581924 | 4.801338 | 13.222553 | 5.932245 | -1 |
| 645 | banned | 27 | 259 | 10 | 12 | 5 | 1 | 43 | 17 | 374 | O | 2.081668e-17 | 81.189208 | 1.736668 | 4.396782 | 5.678389 | 5.924256 | -1 |
| 435 | betting | 0 | 367 | 0 | 0 | 0 | 1 | 0 | 0 | 368 | O | 9.020562e-17 | 121.327037 | 2.637544 | 4.798490 | 13.978833 | 5.908083 | -1 |
| 501 | debts | 20 | 319 | 1 | 1 | 11 | 2 | 2 | 6 | 362 | O | 0.000000e+00 | 103.650555 | 2.290620 | 4.641025 | 9.881062 | 5.891644 | -1 |
| 462 | repay | 7 | 351 | 0 | 0 | 1 | 0 | 1 | 1 | 361 | O | 3.469447e-17 | 115.630270 | 2.562444 | 4.750398 | 12.967467 | 5.888878 | -1 |
| 566 | Turkey | 6 | 6 | 5 | 285 | 8 | 3 | 11 | 4 | 328 | LOC | 1.387779e-17 | 92.252371 | 2.250058 | 4.524528 | 9.488285 | 5.793014 | -1 |
| 556 | Russian | 13 | 8 | 1 | 5 | 291 | 3 | 4 | 1 | 326 | MISC | 0.000000e+00 | 94.658267 | 2.322902 | 4.550273 | 10.205250 | 5.786897 | -1 |
| 652 | Cy | 4 | 246 | 3 | 0 | 1 | 0 | 1 | 44 | 299 | O | 6.245005e-17 | 80.081111 | 2.142638 | 4.383040 | 8.521892 | 5.700444 | -1 |
| 727 | ##hand | 17 | 226 | 9 | 3 | 9 | 1 | 28 | 6 | 299 | O | 4.163336e-17 | 71.747713 | 1.919671 | 4.273156 | 6.818717 | 5.700444 | -1 |
| 725 | gambling | 2 | 226 | 12 | 4 | 1 | 28 | 18 | 6 | 297 | O | 2.775558e-17 | 71.901734 | 1.936747 | 4.275300 | 6.936151 | 5.693732 | -1 |
| 732 | Spain | 15 | 1 | 8 | 1 | 9 | 224 | 17 | 19 | 294 | LOC | 5.551115e-17 | 71.057635 | 1.933541 | 4.263491 | 6.913950 | 5.683580 | -1 |
| 692 | labour | 13 | 235 | 1 | 7 | 24 | 13 | 0 | 0 | 293 | O | 3.469447e-17 | 75.380597 | 2.058173 | 4.322550 | 7.831651 | 5.680173 | -1 |
| 900 | breathing | 16 | 190 | 27 | 9 | 8 | 3 | 14 | 14 | 281 | O | 2.775558e-17 | 58.907634 | 1.677086 | 4.075971 | 5.349942 | 5.638355 | -1 |
| 797 | respiratory | 1 | 201 | 2 | 4 | 0 | 0 | 61 | 3 | 272 | O | 4.857226e-17 | 66.053009 | 1.942736 | 4.190458 | 6.977813 | 5.605802 | -1 |
| 893 | pregnancy | 1 | 189 | 13 | 27 | 4 | 6 | 19 | 12 | 271 | O | 1.387779e-17 | 59.157496 | 1.746347 | 4.080203 | 5.733618 | 5.602119 | -1 |
| 715 | bars | 12 | 226 | 6 | 1 | 3 | 3 | 13 | 7 | 271 | O | 3.469447e-17 | 72.726263 | 2.146901 | 4.286703 | 8.558293 | 5.602119 | -1 |
| 922 | MP | 1 | 169 | 2 | 0 | 2 | 0 | 8 | 83 | 265 | O | 0.000000e+00 | 57.819628 | 1.745498 | 4.057328 | 5.728755 | 5.579730 | -1 |
| 833 | affairs | 11 | 199 | 6 | 8 | 24 | 6 | 2 | 7 | 263 | O | 2.775558e-17 | 63.088108 | 1.919030 | 4.144532 | 6.814345 | 5.572154 | -1 |
| 631 | catering | 2 | 253 | 0 | 0 | 3 | 0 | 3 | 0 | 261 | O | 6.938894e-18 | 83.303568 | 2.553366 | 4.422491 | 12.850286 | 5.564520 | -1 |
| 1018 | lung | 5 | 161 | 10 | 5 | 0 | 3 | 57 | 8 | 249 | O | 2.775558e-17 | 52.013069 | 1.671103 | 3.951495 | 5.318028 | 5.517453 | -1 |
| 825 | cheating | 3 | 21 | 1 | 199 | 3 | 7 | 7 | 7 | 248 | O | 6.938894e-18 | 63.757353 | 2.056689 | 4.155085 | 7.820033 | 5.513429 | -1 |
| 701 | cellular | 0 | 3 | 226 | 0 | 1 | 1 | 2 | 9 | 242 | O | 2.775558e-17 | 74.036731 | 2.447495 | 4.304561 | 11.559357 | 5.488938 | -1 |
| 831 | graduates | 2 | 197 | 4 | 13 | 5 | 10 | 4 | 6 | 241 | O | 2.081668e-17 | 63.161376 | 2.096643 | 4.145693 | 8.138803 | 5.484797 | -1 |
| 994 | residents | 23 | 169 | 6 | 3 | 9 | 0 | 9 | 18 | 237 | O | 1.387779e-17 | 53.150582 | 1.794112 | 3.973129 | 6.014135 | 5.468060 | -1 |
| 1049 | shortage | 1 | 162 | 22 | 7 | 11 | 7 | 15 | 10 | 235 | O | 3.469447e-17 | 50.460226 | 1.717795 | 3.921185 | 5.572228 | 5.459586 | -1 |
| 703 | Muslims | 3 | 0 | 0 | 7 | 0 | 0 | 225 | 0 | 235 | MISC | 6.938894e-18 | 73.976242 | 2.518340 | 4.303744 | 12.407984 | 5.459586 | -1 |
| 1054 | conducted | 17 | 161 | 11 | 6 | 19 | 6 | 5 | 7 | 232 | O | 2.081668e-17 | 50.137311 | 1.728873 | 3.914765 | 5.634299 | 5.446737 | -1 |
| 908 | @ | 29 | 4 | 182 | 4 | 6 | 3 | 1 | 0 | 229 | O | 6.938894e-18 | 58.617270 | 2.047765 | 4.071029 | 7.750558 | 5.433722 | -1 |
| 696 | ##iya | 0 | 0 | 0 | 0 | 0 | 1 | 226 | 0 | 227 | O | 6.938894e-18 | 74.695946 | 2.632456 | 4.313426 | 13.907889 | 5.424950 | -1 |
| 699 | ##dh | 0 | 0 | 0 | 0 | 0 | 0 | 225 | 0 | 225 | O | 6.938894e-18 | 74.411756 | 2.645751 | 4.309614 | 14.094030 | 5.416100 | -1 |
| 912 | Arabic | 1 | 3 | 9 | 2 | 5 | 8 | 182 | 13 | 223 | MISC | 3.469447e-17 | 58.374732 | 2.094161 | 4.066883 | 8.118625 | 5.407172 | -1 |
| 972 | raw | 5 | 172 | 14 | 2 | 7 | 4 | 5 | 13 | 222 | O | 0.000000e+00 | 54.666603 | 1.969968 | 4.001253 | 7.170445 | 5.402677 | -1 |
| 916 | Kim | 2 | 5 | 25 | 0 | 4 | 0 | 3 | 180 | 219 | PER | 0.000000e+00 | 58.184915 | 2.125476 | 4.063626 | 8.376887 | 5.389072 | -1 |
| 856 | ##fill | 14 | 190 | 6 | 0 | 2 | 0 | 2 | 1 | 215 | O | 2.775558e-17 | 61.809056 | 2.299872 | 4.124050 | 9.972904 | 5.370638 | -1 |
| 979 | employers | 13 | 169 | 4 | 4 | 6 | 5 | 9 | 3 | 213 | O | 4.163336e-17 | 53.900226 | 2.024422 | 3.987135 | 7.571730 | 5.361292 | -1 |
| 1028 | EU | 162 | 0 | 10 | 4 | 12 | 19 | 2 | 1 | 210 | ORG | 2.775558e-17 | 51.669019 | 1.968344 | 3.944858 | 7.158809 | 5.347108 | -1 |
| 1126 | link | 11 | 150 | 9 | 3 | 6 | 9 | 13 | 3 | 204 | O | 4.857226e-17 | 47.175205 | 1.850008 | 3.853868 | 6.359871 | 5.318120 | -1 |
| 1186 | junior | 6 | 28 | 142 | 3 | 5 | 1 | 1 | 17 | 203 | O | 0.000000e+00 | 44.941455 | 1.771092 | 3.805361 | 5.877267 | 5.313206 | -1 |
| 1137 | Tokyo | 0 | 52 | 139 | 0 | 0 | 0 | 1 | 2 | 194 | LOC | 4.163336e-17 | 46.536948 | 1.919049 | 3.840247 | 6.814477 | 5.267858 | -1 |
| 1003 | ##smo | 1 | 162 | 1 | 0 | 3 | 0 | 21 | 4 | 192 | O | 4.163336e-17 | 52.564246 | 2.190177 | 3.962036 | 8.936794 | 5.257495 | -1 |
| 1269 | ##ah | 10 | 18 | 1 | 7 | 1 | 3 | 133 | 17 | 190 | O | 2.081668e-17 | 41.757484 | 1.758210 | 3.731879 | 5.802042 | 5.247024 | -1 |
| 1234 | crops | 1 | 21 | 136 | 2 | 3 | 2 | 8 | 16 | 189 | O | 2.081668e-17 | 43.025973 | 1.821205 | 3.761804 | 6.179301 | 5.241747 | -1 |
| 909 | Koreans | 0 | 5 | 2 | 0 | 0 | 0 | 0 | 178 | 185 | MISC | 5.551115e-17 | 58.560732 | 2.532356 | 4.070064 | 12.583117 | 5.220356 | -1 |
| 967 | Low | 1 | 168 | 0 | 0 | 1 | 1 | 10 | 4 | 185 | O | 6.245005e-17 | 54.846234 | 2.371729 | 4.004534 | 10.715904 | 5.220356 | -1 |
| 1192 | Rama | 1 | 1 | 0 | 47 | 0 | 0 | 134 | 0 | 183 | O | 6.938894e-18 | 44.694624 | 1.953863 | 3.799853 | 7.055894 | 5.209486 | -1 |
| 920 | Moscow | 2 | 1 | 0 | 0 | 176 | 0 | 2 | 0 | 181 | LOC | 2.081668e-17 | 57.976154 | 2.562482 | 4.060032 | 12.967963 | 5.198497 | -1 |
| 1188 | din | 10 | 141 | 3 | 0 | 6 | 7 | 10 | 3 | 180 | O | 3.469447e-17 | 44.908240 | 1.995922 | 3.804621 | 7.358983 | 5.192957 | -1 |
| 1389 | Happiness | 0 | 45 | 1 | 4 | 1 | 8 | 117 | 4 | 180 | O | 2.081668e-17 | 38.343839 | 1.704171 | 3.646594 | 5.496825 | 5.192957 | -1 |
| 1305 | feminist | 3 | 1 | 0 | 6 | 123 | 45 | 0 | 0 | 178 | O | 3.469447e-17 | 40.680923 | 1.828356 | 3.705759 | 6.223647 | 5.181784 | -1 |
| 1410 | Boston | 2 | 115 | 39 | 0 | 0 | 0 | 2 | 18 | 176 | O | 5.551115e-17 | 37.426595 | 1.701209 | 3.622382 | 5.480569 | 5.170484 | -1 |
| 1251 | ##ncies | 14 | 133 | 2 | 3 | 6 | 6 | 5 | 0 | 169 | O | 6.938894e-18 | 42.463035 | 2.010085 | 3.748634 | 7.463948 | 5.129899 | -1 |
| 1307 | ##dan | 3 | 2 | 4 | 27 | 3 | 0 | 126 | 3 | 168 | O | 1.387779e-17 | 40.503086 | 1.928718 | 3.701378 | 6.880686 | 5.123964 | -1 |
| 1432 | Euro | 117 | 10 | 13 | 1 | 7 | 7 | 2 | 3 | 160 | O | 4.163336e-17 | 36.861226 | 1.843061 | 3.607160 | 6.315843 | 5.075174 | -1 |
| 1245 | adverse | 2 | 132 | 5 | 2 | 2 | 0 | 11 | 2 | 156 | O | 1.387779e-17 | 42.638011 | 2.186565 | 3.752746 | 8.904570 | 5.049856 | -1 |
| 1115 | Credit | 6 | 145 | 1 | 0 | 1 | 1 | 0 | 1 | 155 | O | 5.551115e-17 | 47.515622 | 2.452419 | 3.861059 | 11.616415 | 5.043425 | -1 |
| 1478 | boost | 15 | 113 | 4 | 2 | 8 | 2 | 7 | 4 | 155 | O | 1.387779e-17 | 35.608768 | 1.837872 | 3.572592 | 6.283153 | 5.043425 | -1 |
| 1524 | ##backs | 5 | 110 | 3 | 5 | 10 | 10 | 6 | 2 | 151 | O | 1.387779e-17 | 34.548652 | 1.830392 | 3.542369 | 6.236332 | 5.017280 | -1 |
| 1425 | unwanted | 0 | 116 | 1 | 16 | 3 | 3 | 11 | 1 | 151 | O | 5.551115e-17 | 37.085838 | 1.964813 | 3.613235 | 7.133576 | 5.017280 | -1 |
| 1218 | Professional | 1 | 134 | 0 | 2 | 4 | 2 | 3 | 0 | 146 | O | 4.857226e-17 | 43.768567 | 2.398278 | 3.778916 | 11.004207 | 4.983607 | -1 |
| 1488 | cheat | 3 | 12 | 2 | 111 | 3 | 6 | 4 | 4 | 145 | O | 6.938894e-18 | 35.225834 | 1.943494 | 3.561780 | 6.983109 | 4.976734 | -1 |
| 1518 | ##gna | 3 | 109 | 1 | 5 | 11 | 7 | 6 | 0 | 142 | O | 6.938894e-18 | 34.643722 | 1.951759 | 3.545117 | 7.041062 | 4.955827 | -1 |
| 1275 | Muslim | 3 | 0 | 1 | 2 | 0 | 4 | 127 | 0 | 137 | MISC | 6.938894e-18 | 41.552489 | 2.426423 | 3.726957 | 11.318321 | 4.919981 | -1 |
| 1452 | [UNK] | 0 | 0 | 0 | 25 | 1 | 0 | 111 | 0 | 137 | O | 6.938894e-18 | 36.402052 | 2.125667 | 3.594625 | 8.378486 | 4.919981 | -1 |
| 1423 | Ban | 1 | 115 | 4 | 1 | 0 | 3 | 10 | 2 | 136 | PER | 4.163336e-17 | 37.155080 | 2.185593 | 3.615101 | 8.895922 | 4.912655 | -1 |
| 1375 | constructing | 4 | 119 | 2 | 0 | 7 | 3 | 0 | 1 | 136 | O | 2.081668e-17 | 38.613469 | 2.271381 | 3.653601 | 9.692773 | 4.912655 | -1 |
| 1673 | SA | 3 | 97 | 9 | 0 | 0 | 0 | 22 | 4 | 135 | O | 4.163336e-17 | 31.066210 | 1.840961 | 3.436121 | 6.302589 | 4.905275 | -1 |
| 1664 | – | 0 | 10 | 8 | 2 | 1 | 5 | 10 | 99 | 135 | O | 5.551115e-17 | 31.258749 | 1.852370 | 3.442299 | 6.374912 | 4.905275 | -1 |
| 1420 | prospects | 3 | 115 | 5 | 1 | 8 | 1 | 1 | 1 | 135 | O | 2.775558e-17 | 37.163280 | 2.202268 | 3.615321 | 9.045509 | 4.905275 | -1 |
| 1687 | toxic | 7 | 98 | 7 | 1 | 9 | 2 | 7 | 4 | 135 | O | 1.387779e-17 | 30.771080 | 1.823471 | 3.426575 | 6.193321 | 4.905275 | -1 |
| 1284 | Taiwanese | 0 | 126 | 1 | 0 | 0 | 0 | 0 | 6 | 133 | MISC | 6.938894e-18 | 41.385195 | 2.489335 | 3.722923 | 12.053259 | 4.890349 | -1 |
| 1678 | feminism | 0 | 1 | 2 | 2 | 93 | 34 | 0 | 0 | 132 | O | 6.938894e-18 | 30.894983 | 1.872423 | 3.430594 | 6.504038 | 4.882802 | -1 |
| 1445 | Turkish | 14 | 0 | 0 | 112 | 0 | 0 | 4 | 0 | 130 | MISC | 6.245005e-17 | 36.475163 | 2.244625 | 3.596632 | 9.436880 | 4.867534 | -1 |
| 1300 | GM | 0 | 4 | 124 | 0 | 0 | 0 | 1 | 0 | 129 | ORG | 6.245005e-17 | 40.793497 | 2.529829 | 3.708523 | 12.551363 | 4.859812 | -1 |
| 1435 | imported | 4 | 113 | 4 | 0 | 0 | 1 | 4 | 2 | 128 | O | 6.245005e-17 | 36.698093 | 2.293631 | 3.602725 | 9.910857 | 4.852030 | -1 |
| 1306 | Shen | 0 | 123 | 0 | 2 | 0 | 0 | 0 | 1 | 126 | O | 2.081668e-17 | 40.542416 | 2.574122 | 3.702349 | 13.119788 | 4.836282 | -1 |
| 1387 | ##turn | 2 | 117 | 0 | 0 | 6 | 0 | 0 | 0 | 125 | O | 9.020562e-17 | 38.366449 | 2.455453 | 3.647183 | 11.651707 | 4.828314 | -1 |
| 1475 | Berlin | 110 | 2 | 1 | 1 | 5 | 2 | 1 | 2 | 124 | LOC | 1.387779e-17 | 35.738635 | 2.305718 | 3.576232 | 10.031382 | 4.820282 | -1 |
| 1809 | ##crow | 11 | 91 | 1 | 2 | 8 | 4 | 3 | 2 | 122 | O | 2.775558e-17 | 28.808636 | 1.889091 | 3.360675 | 6.613354 | 4.804021 | -1 |
| 1361 | ##dah | 1 | 0 | 0 | 0 | 0 | 0 | 118 | 0 | 119 | O | 6.245005e-17 | 38.978961 | 2.620434 | 3.663022 | 13.741691 | 4.779123 | -1 |
| 1525 | Valley | 2 | 106 | 0 | 0 | 0 | 7 | 4 | 0 | 119 | O | 6.245005e-17 | 34.523316 | 2.320895 | 3.541635 | 10.184788 | 4.779123 | -1 |
| 1817 | personnel | 6 | 90 | 6 | 0 | 13 | 1 | 2 | 1 | 119 | O | 4.857226e-17 | 28.672450 | 1.927560 | 3.355937 | 6.872718 | 4.779123 | -1 |
| 1588 | Kuwait | 0 | 1 | 0 | 0 | 14 | 3 | 101 | 0 | 119 | LOC | 6.245005e-17 | 32.857410 | 2.208901 | 3.492177 | 9.105708 | 4.779123 | -1 |
| 1408 | ##amo | 3 | 1 | 0 | 0 | 0 | 0 | 114 | 1 | 119 | O | 2.081668e-17 | 37.478119 | 2.519537 | 3.623757 | 12.422848 | 4.779123 | -1 |
| 1631 | losses | 4 | 99 | 6 | 0 | 3 | 2 | 2 | 1 | 117 | O | 2.081668e-17 | 31.937194 | 2.183740 | 3.463771 | 8.879451 | 4.762174 | -1 |
| 1629 | ##tus | 0 | 98 | 1 | 0 | 0 | 1 | 15 | 0 | 115 | O | 2.081668e-17 | 31.972400 | 2.224167 | 3.464873 | 9.245778 | 4.744932 | -1 |
| 1813 | ##mpo | 3 | 90 | 2 | 4 | 1 | 4 | 6 | 3 | 113 | O | 1.387779e-17 | 28.711659 | 2.032684 | 3.357303 | 7.634549 | 4.727388 | -1 |
| 1471 | bartender | 0 | 109 | 0 | 0 | 0 | 0 | 3 | 1 | 113 | O | 2.775558e-17 | 35.872822 | 2.539669 | 3.579980 | 12.675473 | 4.727388 | -1 |
| 1569 | Libya | 0 | 2 | 0 | 0 | 0 | 1 | 102 | 7 | 112 | LOC | 6.938894e-18 | 33.335417 | 2.381101 | 3.506620 | 10.816808 | 4.718499 | -1 |
| 1465 | Jed | 0 | 0 | 0 | 0 | 0 | 1 | 109 | 0 | 110 | O | 6.245005e-17 | 36.002604 | 2.618371 | 3.583591 | 13.713369 | 4.700480 | -1 |
| 1896 | Islam | 1 | 2 | 1 | 16 | 1 | 2 | 85 | 0 | 108 | MISC | 6.245005e-17 | 27.463612 | 2.034342 | 3.312862 | 7.647216 | 4.682131 | -1 |
| 1740 | Eva | 1 | 92 | 0 | 0 | 2 | 0 | 3 | 8 | 106 | O | 6.938894e-17 | 29.869508 | 2.254302 | 3.396838 | 9.528645 | 4.663439 | -1 |
| 1670 | ##yang | 1 | 95 | 0 | 0 | 0 | 0 | 0 | 7 | 103 | O | 7.632783e-17 | 31.122490 | 2.417281 | 3.437931 | 11.215321 | 4.634729 | -1 |
| 2084 | crossing | 9 | 78 | 1 | 1 | 4 | 2 | 5 | 2 | 102 | O | 4.163336e-17 | 24.787850 | 1.944145 | 3.210354 | 6.987655 | 4.624973 | -1 |
| 2006 | holy | 9 | 0 | 0 | 0 | 3 | 2 | 80 | 8 | 102 | O | 3.469447e-17 | 25.635669 | 2.010641 | 3.243985 | 7.468100 | 4.624973 | -1 |
| 1935 | ##ibly | 4 | 83 | 3 | 0 | 3 | 4 | 1 | 0 | 98 | O | 6.938894e-18 | 26.785024 | 2.186533 | 3.287843 | 8.904284 | 4.584967 | -1 |
| 1954 | investors | 7 | 82 | 0 | 0 | 2 | 5 | 2 | 0 | 98 | O | 4.857226e-17 | 26.470502 | 2.160857 | 3.276031 | 8.678575 | 4.584967 | -1 |
| 1628 | Munich | 97 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 97 | LOC | 6.245005e-17 | 32.079735 | 2.645751 | 3.468225 | 14.094030 | 4.574711 | -1 |
| 1989 | prayer | 2 | 3 | 2 | 0 | 2 | 4 | 80 | 1 | 94 | O | 2.775558e-17 | 25.820292 | 2.197472 | 3.251161 | 9.002224 | 4.543295 | -1 |
| 2012 | Mao | 0 | 16 | 78 | 0 | 0 | 0 | 0 | 0 | 94 | PER | 6.938894e-18 | 25.581976 | 2.177189 | 3.241888 | 8.821478 | 4.543295 | -1 |
| 1944 | prohibition | 0 | 82 | 0 | 4 | 6 | 1 | 0 | 0 | 93 | O | 3.469447e-17 | 26.683035 | 2.295315 | 3.284028 | 9.927561 | 4.532599 | -1 |
| 1689 | Scheme | 0 | 93 | 0 | 0 | 0 | 0 | 0 | 0 | 93 | O | 6.245005e-17 | 30.756859 | 2.645751 | 3.426113 | 14.094030 | 4.532599 | -1 |
| 1964 | foe | 0 | 81 | 2 | 0 | 4 | 0 | 5 | 0 | 92 | O | 4.857226e-17 | 26.334388 | 2.289947 | 3.270876 | 9.874412 | 4.521789 | -1 |
| 2045 | Arab | 3 | 1 | 5 | 1 | 2 | 2 | 78 | 0 | 92 | MISC | 6.938894e-18 | 25.174392 | 2.189078 | 3.225827 | 8.926975 | 4.521789 | -1 |
| 1946 | ##rting | 0 | 82 | 2 | 0 | 3 | 2 | 3 | 0 | 92 | O | 2.081668e-17 | 26.673957 | 2.319475 | 3.283688 | 10.170329 | 4.521789 | -1 |
| 1792 | Taipei | 0 | 88 | 0 | 0 | 0 | 0 | 0 | 2 | 90 | LOC | 4.857226e-17 | 29.016159 | 2.579214 | 3.367853 | 13.186771 | 4.499810 | -1 |
| 1893 | Islamic | 0 | 0 | 1 | 3 | 1 | 1 | 84 | 0 | 90 | O | 6.938894e-18 | 27.512497 | 2.445555 | 3.314640 | 11.536954 | 4.499810 | -1 |
| 2101 | Spring | 0 | 75 | 1 | 0 | 0 | 0 | 12 | 0 | 88 | LOC | 6.245005e-17 | 24.500000 | 2.227273 | 3.198673 | 9.274537 | 4.477337 | -1 |
| 2047 | Beijing | 0 | 77 | 1 | 0 | 0 | 0 | 0 | 9 | 87 | LOC | 6.938894e-17 | 25.161665 | 2.313716 | 3.225322 | 10.111934 | 4.465908 | -1 |
| 2107 | Germans | 75 | 0 | 0 | 1 | 2 | 0 | 3 | 3 | 84 | MISC | 4.163336e-17 | 24.407990 | 2.324571 | 3.194911 | 10.222289 | 4.430817 | -1 |
| 2261 | immunity | 0 | 2 | 4 | 1 | 0 | 1 | 70 | 6 | 84 | O | 3.469447e-17 | 22.572107 | 2.149724 | 3.116715 | 8.582493 | 4.430817 | -1 |
| 2097 | terminate | 1 | 75 | 0 | 0 | 0 | 0 | 7 | 0 | 83 | O | 3.469447e-17 | 24.530275 | 2.364364 | 3.199908 | 10.637269 | 4.418841 | -1 |
| 2136 | Hamburg | 74 | 0 | 4 | 1 | 0 | 0 | 0 | 4 | 83 | LOC | 4.163336e-17 | 24.103617 | 2.323240 | 3.182362 | 10.208699 | 4.418841 | -1 |
| 1966 | Colombia | 1 | 0 | 2 | 0 | 0 | 80 | 0 | 0 | 83 | LOC | 6.938894e-18 | 26.324596 | 2.537310 | 3.270504 | 12.645615 | 4.418841 | -1 |
| 1986 | tavern | 0 | 79 | 0 | 1 | 0 | 0 | 0 | 2 | 82 | O | 2.775558e-17 | 25.993990 | 2.535999 | 3.257865 | 12.629041 | 4.406719 | -1 |
| 1917 | HK | 0 | 82 | 0 | 0 | 0 | 0 | 0 | 0 | 82 | LOC | 6.245005e-17 | 27.118951 | 2.645751 | 3.300233 | 14.094030 | 4.406719 | -1 |
| 2072 | sorting | 1 | 76 | 0 | 0 | 1 | 0 | 3 | 0 | 81 | O | 6.938894e-18 | 24.917050 | 2.460943 | 3.215552 | 11.715857 | 4.394449 | -1 |
| 2161 | ##erman | 73 | 2 | 1 | 2 | 1 | 0 | 0 | 1 | 80 | O | 2.775558e-17 | 23.822258 | 2.382226 | 3.170620 | 10.828980 | 4.382027 | -1 |
| 2036 | Istanbul | 2 | 0 | 0 | 77 | 0 | 0 | 1 | 0 | 80 | LOC | 3.469447e-17 | 25.332785 | 2.533279 | 3.232099 | 12.594730 | 4.382027 | -1 |
| 1977 | Ain | 0 | 0 | 79 | 0 | 0 | 0 | 0 | 0 | 79 | O | 6.938894e-18 | 26.126794 | 2.645751 | 3.262961 | 14.094030 | 4.369448 | -1 |
| 2440 | Arabian | 3 | 1 | 2 | 3 | 0 | 2 | 64 | 3 | 78 | MISC | 3.469447e-17 | 20.528943 | 2.105533 | 3.021836 | 8.211475 | 4.356709 | -1 |
| 2488 | monsters | 5 | 2 | 62 | 0 | 6 | 0 | 2 | 0 | 77 | O | 3.469447e-17 | 19.911915 | 2.068770 | 2.991318 | 7.915085 | 4.343805 | -1 |
| 2474 | neighbour | 62 | 2 | 0 | 0 | 9 | 3 | 0 | 0 | 76 | O | 3.469447e-17 | 20.049938 | 2.110520 | 2.998226 | 8.252529 | 4.330733 | -1 |
| 2257 | ##mission | 1 | 69 | 0 | 0 | 0 | 1 | 1 | 1 | 73 | O | 3.469447e-17 | 22.635357 | 2.480587 | 3.119513 | 11.948276 | 4.290459 | -1 |
| 2506 | ##PM | 0 | 61 | 2 | 0 | 0 | 0 | 6 | 3 | 72 | O | 6.938894e-18 | 19.754746 | 2.194972 | 2.983394 | 8.979748 | 4.276666 | -1 |
| 2627 | ##zone | 58 | 2 | 1 | 0 | 1 | 9 | 1 | 0 | 72 | O | 3.469447e-17 | 18.721645 | 2.080183 | 2.929680 | 8.005932 | 4.276666 | -1 |
| 2251 | Wu | 2 | 69 | 0 | 0 | 0 | 0 | 0 | 1 | 72 | PER | 8.326673e-17 | 22.688103 | 2.520900 | 3.121841 | 12.439791 | 4.276666 | -1 |
| 2500 | mosque | 0 | 1 | 0 | 7 | 1 | 1 | 61 | 0 | 71 | O | 6.938894e-18 | 19.820680 | 2.233316 | 2.986726 | 9.330757 | 4.262680 | -1 |
| 2545 | Constitution | 1 | 60 | 0 | 0 | 5 | 4 | 0 | 1 | 71 | O | 4.163336e-17 | 19.406426 | 2.186640 | 2.965604 | 8.905237 | 4.262680 | -1 |
| 2756 | therapist | 1 | 7 | 1 | 1 | 1 | 2 | 55 | 2 | 70 | O | 5.551115e-17 | 17.583728 | 2.009569 | 2.866974 | 7.460101 | 4.248495 | -1 |
| 2544 | Grandpa | 2 | 60 | 0 | 0 | 0 | 3 | 4 | 1 | 70 | PER | 4.857226e-17 | 19.421316 | 2.219579 | 2.966371 | 9.203455 | 4.248495 | -1 |
| 2395 | Peak | 0 | 64 | 1 | 0 | 0 | 0 | 3 | 1 | 69 | O | 4.163336e-17 | 20.951954 | 2.429212 | 3.042232 | 11.349935 | 4.234107 | -1 |
| 2753 | Cat | 3 | 55 | 2 | 2 | 0 | 2 | 2 | 2 | 68 | O | 2.775558e-17 | 17.592612 | 2.069719 | 2.867479 | 7.922597 | 4.219508 | -1 |
| 2302 | Lok | 0 | 67 | 0 | 0 | 0 | 0 | 1 | 0 | 68 | ORG | 9.020562e-17 | 22.113344 | 2.601570 | 3.096181 | 13.484892 | 4.219508 | -1 |
| 2817 | sphere | 7 | 1 | 0 | 0 | 53 | 2 | 3 | 0 | 66 | O | 1.387779e-17 | 17.056890 | 2.067502 | 2.836554 | 7.905050 | 4.189655 | -1 |
| 2524 | Cafe | 0 | 60 | 0 | 2 | 0 | 1 | 1 | 2 | 66 | O | 2.081668e-17 | 19.575176 | 2.372749 | 2.974262 | 10.726835 | 4.189655 | -1 |
| 2738 | urine | 0 | 55 | 1 | 1 | 0 | 1 | 6 | 2 | 66 | O | 6.245005e-17 | 17.760560 | 2.152795 | 2.876980 | 8.608888 | 4.189655 | -1 |
| 2875 | Soviet | 2 | 2 | 5 | 0 | 52 | 0 | 0 | 5 | 66 | MISC | 1.387779e-17 | 16.648949 | 2.018054 | 2.812347 | 7.523673 | 4.189655 | -1 |
| 2614 | Consul | 2 | 58 | 2 | 0 | 1 | 0 | 1 | 1 | 65 | O | 9.020562e-17 | 18.864235 | 2.321752 | 2.937268 | 10.193518 | 4.174387 | -1 |
| 2644 | Library | 0 | 57 | 0 | 1 | 1 | 0 | 2 | 3 | 64 | ORG | 6.245005e-17 | 18.547237 | 2.318405 | 2.920321 | 10.159453 | 4.158883 | -1 |
| 2640 | unification | 3 | 0 | 1 | 0 | 3 | 0 | 0 | 57 | 64 | O | 0.000000e+00 | 18.560711 | 2.320089 | 2.921047 | 10.176579 | 4.158883 | -1 |
| 2442 | Augsburg | 62 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 62 | LOC | 6.245005e-17 | 20.504573 | 2.645751 | 3.020648 | 14.094030 | 4.127134 | -1 |
| 2933 | Shanghai | 9 | 50 | 0 | 0 | 2 | 0 | 0 | 0 | 61 | LOC | 3.469447e-17 | 16.278341 | 2.134864 | 2.789835 | 8.455899 | 4.110874 | -1 |
| 2722 | ##ogen | 0 | 55 | 2 | 0 | 0 | 0 | 3 | 1 | 61 | O | 2.775558e-17 | 17.936956 | 2.352388 | 2.886863 | 10.510635 | 4.110874 | -1 |
| 3082 | Industry | 3 | 48 | 2 | 0 | 3 | 1 | 3 | 0 | 60 | O | 4.163336e-17 | 15.354153 | 2.047220 | 2.731386 | 7.746339 | 4.094345 | -1 |
| 2625 | Venezuela | 0 | 0 | 0 | 0 | 3 | 57 | 0 | 0 | 60 | LOC | 6.938894e-18 | 18.734994 | 2.497999 | 2.930393 | 12.158144 | 4.094345 | -1 |
| 3076 | ##wl | 5 | 48 | 0 | 0 | 2 | 1 | 3 | 1 | 60 | O | 4.163336e-17 | 15.386683 | 2.051558 | 2.733502 | 7.780010 | 4.094345 | -1 |
| 2792 | ion | 1 | 53 | 1 | 0 | 0 | 0 | 4 | 1 | 60 | O | 1.387779e-17 | 17.240940 | 2.298792 | 2.847287 | 9.962140 | 4.094345 | -1 |
| 2889 | Tommy | 1 | 51 | 0 | 0 | 0 | 2 | 5 | 0 | 59 | PER | 4.857226e-17 | 16.567570 | 2.246450 | 2.807447 | 9.454116 | 4.077537 | -1 |
| 2534 | Frankfurt | 59 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 59 | LOC | 6.245005e-17 | 19.512416 | 2.645751 | 2.971051 | 14.094030 | 4.077537 | -1 |
| 2993 | Line | 0 | 49 | 5 | 0 | 1 | 1 | 0 | 1 | 57 | O | 3.469447e-17 | 15.901553 | 2.231797 | 2.766417 | 9.316592 | 4.043051 | -1 |
| 2692 | cycling | 55 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 56 | O | 3.469447e-17 | 18.145247 | 2.592178 | 2.898409 | 13.358838 | 4.025352 | -1 |
| 2774 | ##won | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 53 | 55 | O | 0.000000e+00 | 17.438732 | 2.536543 | 2.858694 | 12.635911 | 4.007333 | -1 |
| 3293 | sergeant | 1 | 0 | 5 | 0 | 2 | 2 | 0 | 44 | 54 | O | 0.000000e+00 | 14.166422 | 2.098729 | 2.650874 | 8.155798 | 3.988984 | -1 |
| 3193 | Cheung | 0 | 45 | 0 | 0 | 0 | 0 | 8 | 0 | 53 | PER | 7.632783e-17 | 14.738873 | 2.224736 | 2.690488 | 9.251036 | 3.970292 | -1 |
| 3090 | brushes | 1 | 47 | 0 | 0 | 2 | 0 | 0 | 3 | 53 | O | 6.938894e-17 | 15.296548 | 2.308913 | 2.727627 | 10.063478 | 3.970292 | -1 |
| 2763 | ý | 0 | 0 | 0 | 53 | 0 | 0 | 0 | 0 | 53 | O | 6.938894e-18 | 17.528102 | 2.645751 | 2.863805 | 14.094030 | 3.970292 | -1 |
| 2962 | Grandma | 3 | 49 | 0 | 0 | 0 | 0 | 0 | 0 | 52 | PER | 6.245005e-17 | 16.093477 | 2.475920 | 2.778414 | 11.892638 | 3.951244 | -1 |
| 3070 | broadband | 0 | 47 | 0 | 0 | 0 | 1 | 1 | 1 | 50 | O | 4.163336e-17 | 15.409007 | 2.465441 | 2.734952 | 11.768673 | 3.912023 | -1 |
| 2897 | Libyan | 0 | 0 | 0 | 0 | 0 | 0 | 50 | 0 | 50 | MISC | 6.938894e-18 | 16.535946 | 2.645751 | 2.805537 | 14.094030 | 3.912023 | -1 |
| 3537 | ##IP | 1 | 1 | 6 | 0 | 0 | 2 | 40 | 0 | 50 | O | 6.245005e-17 | 12.891373 | 2.062620 | 2.556558 | 7.866551 | 3.912023 | -1 |
| 3266 | Nam | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 44 | 50 | O | 5.551115e-17 | 14.298164 | 2.287706 | 2.660131 | 9.852313 | 3.912023 | -1 |
| 3260 | ##á | 0 | 0 | 0 | 0 | 3 | 0 | 44 | 3 | 50 | O | 2.081668e-17 | 14.324367 | 2.291899 | 2.661962 | 9.893706 | 3.912023 | -1 |
| 2895 | Allah | 0 | 0 | 0 | 0 | 0 | 0 | 50 | 0 | 50 | PER | 6.938894e-18 | 16.535946 | 2.645751 | 2.805537 | 14.094030 | 3.912023 | -1 |
| 3541 | Temple | 1 | 40 | 4 | 0 | 3 | 0 | 1 | 0 | 49 | LOC | 6.938894e-18 | 12.878640 | 2.102635 | 2.555570 | 8.187717 | 3.891820 | -1 |
| 3121 | Chairman | 0 | 46 | 0 | 0 | 0 | 0 | 2 | 0 | 48 | O | 7.632783e-17 | 15.132746 | 2.522124 | 2.716861 | 12.455027 | 3.871201 | -1 |
| 3452 | ##world | 0 | 1 | 1 | 0 | 2 | 2 | 0 | 41 | 47 | O | 5.551115e-17 | 13.298849 | 2.263634 | 2.587677 | 9.617976 | 3.850148 | -1 |
| 3117 | Babe | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 46 | 47 | O | 0.000000e+00 | 15.169356 | 2.582018 | 2.719277 | 13.223798 | 3.850148 | -1 |
| 3695 | Hans | 3 | 2 | 38 | 0 | 1 | 1 | 1 | 1 | 47 | O | 6.938894e-18 | 12.170020 | 2.071493 | 2.498976 | 7.936661 | 3.850148 | -1 |
| 3301 | Lebanese | 0 | 3 | 0 | 0 | 0 | 0 | 43 | 0 | 46 | MISC | 6.245005e-17 | 14.113380 | 2.454501 | 2.647123 | 11.640622 | 3.828641 | -1 |
| 3167 | ##ý | 0 | 0 | 0 | 45 | 0 | 0 | 0 | 0 | 45 | O | 6.938894e-18 | 14.882351 | 2.645751 | 2.700176 | 14.094030 | 3.806662 | -1 |
| 3166 | Brandenburg | 45 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 45 | LOC | 6.245005e-17 | 14.882351 | 2.645751 | 2.700176 | 14.094030 | 3.806662 | -1 |
| 3426 | lantern | 0 | 41 | 1 | 0 | 0 | 0 | 2 | 0 | 44 | O | 2.775558e-17 | 13.435029 | 2.442733 | 2.597865 | 11.504434 | 3.784190 | -1 |
| 3420 | Russians | 0 | 1 | 0 | 0 | 41 | 0 | 1 | 0 | 43 | MISC | 3.469447e-17 | 13.471614 | 2.506347 | 2.600585 | 12.260059 | 3.761200 | -1 |
| 3648 | spheres | 1 | 1 | 0 | 0 | 38 | 1 | 1 | 0 | 42 | O | 4.857226e-17 | 12.386989 | 2.359427 | 2.516647 | 10.584879 | 3.737670 | -1 |
| 3907 | ##words | 3 | 1 | 0 | 0 | 0 | 1 | 1 | 35 | 41 | O | 0.000000e+00 | 11.329580 | 2.210650 | 2.427417 | 9.121641 | 3.713572 | -1 |
# Rough percentage of token types to be masked
print('number of token types masked: ', sum(f2['Mask'] == -1), '\n',
'approx proportion of token types masked: ', sum(f2['Mask'] == -1)/len(f2['Mask']))
number of token types masked: 220 approx proportion of token types masked: 0.009435176051807694
px.pie(f2,
values='Total',
names='NE',
title='Percentage of Token Types per NE group'
)
px.pie(f2,
values='Total',
names='Mask',
title='Percentage of Overall Tokens Masked'
)
px.pie(f2.loc[f2['Mask'] == -1],
values='Total',
names='NE',
title='Masked Tokens by NE Group'
)
px.pie(f2.loc[f2['Mask'] == 0],
values='Total',
names='NE',
title='Percentage of Unmasked Tokens by NE Group'
)
b = pd.DataFrame(f2.loc[f2['Mask'] == -1]['Target'].sum(axis=0), columns = ['Frequency'])
px.pie(b,
values='Frequency',
names=b.index,
title='Masked Tokens per Target Group'
)
The pie chart above illustrates that the dbscan mask does not solve the original problem and might actually increase the bias through masking. It is a shocking discovery in itself that 64 percent of the masked tokens appear in the Chinese samples, although Chinese samples represent less than 30 percent of the corpus overall, and the text lengths are not disproportionately longer than the other target groups. The reasons for this are beyond the scope of this research, but might be worth investigating from another perspective, especially if the phenomenon is due to named entities rather than topic imbalance.
c = pd.DataFrame(f2.loc[f2['Mask'] == 0]['Target'].sum(axis=0), columns = ['Frequency'])
px.pie(c,
values='Frequency',
names=c.index,
title='Unmasked Tokens by Target Group'
)
d = pd.DataFrame(f2['Target'].sum(axis=0), columns = ['Frequency'])
px.pie(d,
values='Frequency',
names=d.index,
title='Total Tokens by Target Group'
)
As you can see, the line of demarcation for the masked tokens follows a general formula CV_exp = b/(Total_log-c)+ a, where a, b, and c are parameters of the equation to be optimized, representing the horizontal and vertical displacement, as well as the sharpness of the curve.
To approximate optimum values for these three parameters, I am using the algorithm below:
generate equally distributed combinations of all 3 parameters (a, b, c) across their appropriate ranges. Because the operations are computationally cumbersome, 4^3 = 64 value combinations will be used for each step.
For each combination, calculate the proposed mask and use the following as a 'loss' function: Loss = mean_CV_unmasked + mean_rel_freq + CV_mask
Locate the parameter value set which produced the minimum loss.
Generate 4^3 new values nested between the old values at this location.
Repeat 2.
Continue this process until the new loss and the old loss reach some threshold, say a ratio of 0.99 new_loss/old_loss.
def get_mean_CV(dataframe):
return mean_CV
def get_mean_rel_freq(dataframe):
dataframe
return mean_rel_freq
def get_CV(dataframe, token_name):
return CV
# initial test values
a = np.linspace(f2['CV_exp'].min(), f2['CV_exp'].max(), num=4)
c = np.linspace(f2['Total_log'].min(), f2['Total_log'].max(), num=4)
b = np.linspace(0.01, 3.01, num=4)
#loss
def get_mask(a, b, c):
mask = pd.DataFrame(index=f2.index, columns = ['Mask'])
for token in f2.index:
val = b / (f2.loc[token, 'Total_log'] - c)+ a
if (f2.loc[token, 'CV_exp'] > val).item():
mask.loc[token, 'Mask'] = 0
else:
mask.loc[token, 'Mask'] = 1
mask = mask.squeeze()
return mask
def get_loss(mask):
unmasked_cv = f2['CV'].loc[mask==1]
unm_cv_mean = np.mean(unmasked_cv)
masked_rel_freq = f2['Total'].loc[mask==0].sum()/f2['Total'].sum()
freq_mask = f2.loc[mask==0]
cv_mask = get_CV(freq_mask, 'Target')
loss = unm_cv_mean + masked_rel_freq + cv_mask
return loss
def get_test_vals(a, b, c):
#mask = pd.DataFrame(data = None, index = df.index, columns = ['Mask'])
losses = pd.DataFrame(data= None, index = None, columns = ['a', 'b', 'c', 'loss'])
for i in a:
for j in b:
for k in c:
combo = pd.DataFrame(data = {'a': i,
'b': j,
'c': k,
'loss': None
},
index=pd.Series(0))
mask = get_mask(i, j, k)
loss = get_loss(mask)
combo['loss'] = loss
losses = pd.concat([losses, combo])
losses = losses.reset_index(drop=True)
return losses
loss_matrix.to_csv("mask_loss_matrix.csv")
The idea is to do a few iterations of this, zooming into areas thought to contain a minimum loss value. I have to clean up the coding, which I will do in the next step of the project. The rest of the code below can be ignored for now, since it is only formative.
def find_local_min(a, b, c):
losses = get_test_vals(a, b, c)
min_ind = np.where(losses['loss']==losses['loss'].min())
x = losses.loc[min_ind, 'a']
y = losses.loc[min_ind, 'b']
z = losses.loc[min_ind, 'c']
loss_matrix = get_test_vals(a,b,c)
l = get_test_vals([1, 0.3], [0.5, 0.8], [2, 0.3])
min_ind = np.where(l['loss']==l['loss'].min())
c = [2, 0.3]
c[c==l.loc[min_ind, 'c']]
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [1312], in <cell line: 2>() 1 c = [2, 0.3] ----> 2 c[c==l.loc[min_ind, 'c']] File ~/miniforge3/envs/pyto/lib/python3.9/site-packages/pandas/core/ops/common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other) 66 return NotImplemented 68 other = item_from_zerodim(other) ---> 70 return method(self, other) File ~/miniforge3/envs/pyto/lib/python3.9/site-packages/pandas/core/arraylike.py:40, in OpsMixin.__eq__(self, other) 38 @unpack_zerodim_and_defer("__eq__") 39 def __eq__(self, other): ---> 40 return self._cmp_method(other, operator.eq) File ~/miniforge3/envs/pyto/lib/python3.9/site-packages/pandas/core/series.py:5623, in Series._cmp_method(self, other, op) 5620 rvalues = extract_array(other, extract_numpy=True, extract_range=True) 5622 with np.errstate(all="ignore"): -> 5623 res_values = ops.comparison_op(lvalues, rvalues, op) 5625 return self._construct_result(res_values, name=res_name) File ~/miniforge3/envs/pyto/lib/python3.9/site-packages/pandas/core/ops/array_ops.py:260, in comparison_op(left, right, op) 255 if isinstance(rvalues, (np.ndarray, ABCExtensionArray)): 256 # TODO: make this treatment consistent across ops and classes. 257 # We are not catching all listlikes here (e.g. frozenset, tuple) 258 # The ambiguous case is object-dtype. See GH#27803 259 if len(lvalues) != len(rvalues): --> 260 raise ValueError( 261 "Lengths must match to compare", lvalues.shape, rvalues.shape 262 ) 264 if should_extension_dispatch(lvalues, rvalues) or ( 265 (isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT) 266 and not is_object_dtype(lvalues.dtype) 267 ): 268 # Call the method on lvalues 269 res_values = op(lvalues, rvalues) ValueError: ('Lengths must match to compare', (1,), (2,))
px.scatter_3d(data=loss_matrix, x='a', y='b', z = 'c', color='loss')
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [1270], in <cell line: 1>() ----> 1 px.scatter_3d(data=loss_matrix, x='a', y='b', z = 'c', color='loss') TypeError: scatter_3d() got an unexpected keyword argument 'data'
loss_matrix.loc[loss_matrix['loss'].min()]
Brezina, V. (2018). Vocabulary: Frequency, Dispersion and Diversity. In Statistics in Corpus Linguistics: A Practical Guide (pp. 38-65). Cambridge: Cambridge University Press. doi:10.1017/9781316410899.003
The University of Pittsburgh English Language Institute Corpus (PELIC). (2022). PELIC. https://eli-data-mining-group.github.io/Pitt-ELI-Corpus/
Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28-54.
Geertzen, J. , Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). Selected Proceedings of the 31st Second Language Research Forum (SLRF), Cascadilla Press, MA.